Train Random Forest Classifier for Engagement Prediction and Find Most Important Features

A Random Forest classifier will be trained. The data set will consist of only the users that had their text classified by the text classifier. This restriction is added because we would like to include the percent of pro-protester tweets as a feature of this classification.

Resources: CS109 Homework #5

Imports



In [2]:

    
import numpy as np
import scipy as sp
import pandas as pd
import sklearn
import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib inline

Bring in data - text classification for users(pickle file), August(csv file), and November (csv file).



In [128]:

    
#users with classified tweets
user_tc = pd.read_pickle('final_aug_percents.pkl')



In [129]:

    
#count is the number of tweets with hashtag #Ferguson or #ferguson
#perc_p is the percent of the user's tweets that have been classified as 'pro-protester'
user_tc.head()









    Out[129]:






  
    
      
      user.screen_name
      count
      perc_p
    
  
  
    
      0
            ItriSamele
       11
       0.727273
    
    
      1
            gohogsgirl
       28
       0.857143
    
    
      2
               averrer
       17
       0.823529
    
    
      3
           17147578976
       15
       1.000000
    
    
      4
       fucking_ninoX_X
       12
       0.833333



In [5]:

    
#load august_reduced_all for data for Aug 10 - 17
df_aug = pd.read_csv('/home/data/aug_reduced_all.csv')









    



/home/ubuntu/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py:1154: DtypeWarning: Columns (0,2,3,4,6,7,10,11,12,14,15,17) have mixed types. Specify dtype option on import or set low_memory=False.
  data = self._reader.read(nrows)



In [45]:

    
list(df_aug.columns.values)









    Out[45]:





['Unnamed: 0',
 'id',
 'lang',
 '_iso_created_at_x',
 'text',
 'user.id_x',
 'user.screen_name',
 'user.geo_enabled',
 'user.statuses_count',
 'user.friends_count',
 'user.lang',
 'user.name',
 'user.following',
 'user.followers_count',
 'retweeted',
 'in_reply_to_screen_name',
 'retweet_count',
 'entities.user_mentions',
 '_iso_created_at_y',
 'user.id_y',
 'retweeted_status.user.id',
 'retweeted_status.favorite_count',
 'retweeted_status.favourities_count',
 'retweeted_status.user.id.1',
 'retweeted_status.user.followers_count',
 'retweeted_status.user.friends_count',
 'in_reply_to_status_id',
 'retweeted_status.in_reply_to_status_id',
 'retweeted_status.in_reply_to_user_id']



In [7]:

    
df_nov = pd.read_csv('/home/data/nov_reduced.csv')









    



/home/ubuntu/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py:1154: DtypeWarning: Columns (0,1,2,3,5,6,7,8,9,10,11,13,14,16) have mixed types. Specify dtype option on import or set low_memory=False.
  data = self._reader.read(nrows)



In [8]:

    
df_nov.head()









    Out[8]:






  
    
      
      id
      lang
      _iso_created_at
      text
      user.id
      user.screen_name
      user.geo_enabled
      user.statuses_count
      user.friends_count
      user.lang
      ...
      entities.user_mentions
      retweeted_status.user.id
      retweeted_status.favorite_count
      retweeted_status.favourities_count
      retweeted_status.user.id.1
      retweeted_status.user.followers_count
      retweeted_status.user.friends_count
      in_reply_to_status_id
      retweeted_status.in_reply_to_status_id
      retweeted_status.in_reply_to_user_id
    
  
  
    
      0
       538844861919928320
       en
       2014-11-30T00:00:00.000Z
       @ainsleyearhardt #ferguson doesn't deserve #Da...
        24520375
        PathFlounder
       False
        3617
         160
       en
      ...
       [ { "id" : 186646579, "indices" : [ 0, 16 ], "...
             NaN
       NaN
      NaN
             NaN
          NaN
       NaN
       5.388415e+17
      NaN
      NaN
    
    
      1
       538844861181722624
       en
       2014-11-30T00:00:00.000Z
       Darren Wilson has resigned from the Ferguson p...
        18956073
          dcexaminer
        True
       84607
       12042
       en
      ...
                                                      []
             NaN
       NaN
      NaN
             NaN
          NaN
       NaN
                NaN
      NaN
      NaN
    
    
      2
       538844860866760705
       en
       2014-11-30T00:00:00.000Z
       RT @RT_com: BREAKING: Darren Wilson resigns in...
        22842540
           mickey228
       False
       69514
         424
       ja
      ...
       [ { "id" : 64643056, "indices" : [ 3, 10 ], "i...
        64643056
       120
      NaN
        64643056
       814889
       451
                NaN
      NaN
      NaN
    
    
      3
       538844858832912384
       en
       2014-11-30T00:00:00.000Z
       .@BMUatYale ..."Which of #MyBrothersKeeper gon...
        79838661
             ugotGod
       False
       96307
          71
       en
      ...
       [ { "id" : 1716317970, "indices" : [ 1, 11 ], ...
             NaN
       NaN
      NaN
             NaN
          NaN
       NaN
                NaN
      NaN
      NaN
    
    
      4
       538844856278196224
       en
       2014-11-29T23:59:59.000Z
       RT @TheAnonMessage: BREAKING: Darren Wilson re...
       511004909
       Manny_Fresh21
        True
       27037
         606
       en
      ...
       [ { "id" : 423662810, "indices" : [ 3, 18 ], "...
       423662810
       149
      NaN
       423662810
        98591
       298
                NaN
      NaN
      NaN
    
  

5 rows × 26 columns



In [9]:

    
nov = df_nov[['user.screen_name', '_iso_created_at']]



In [10]:

    
nov_df = pd.DataFrame({'count' : nov.groupby( [ "user.screen_name"] ).size()}).reset_index()



In [11]:

    
nov_df.head()









    Out[11]:






  
    
      
      user.screen_name
      count
    
  
  
    
      0
               000120o
       43
    
    
      1
          000Dillon000
       15
    
    
      2
       000RowanPark000
       25
    
    
      3
                000kek
        6
    
    
      4
         007JamesBong_
       11

Prepare features in August data

The following features will be used:

average number of friends
average number of followers
maximum number of total tweets
number of replies in the #F/ferguson tweets
number of retweets in the #F/ferguson tweets
percentage of #F/ferguson tweets that were replies
percentage of #F/ferguson tweets that were reweets



In [ ]:

    
#group by id and get average friends
fr_df = df_aug[['user.screen_name', 'user.friends_count']].dropna(how='all')
friends = fr_df.groupby(['user.screen_name'], as_index=False).mean()



In [ ]:

    
#group by id and get average followers
fo_df = df_aug[['user.screen_name', 'user.followers_count']].dropna(how='all')
fo_df['user.followers_count'] = fo_df['user.followers_count'].convert_objects(convert_numeric=True)
followers = fo_df.groupby(['user.screen_name'], as_index=False).mean()



In [ ]:

    
#group by id and get max of total tweets
tt_df = df_aug[['user.screen_name', 'user.statuses_count']].dropna(how='all')
tt_df['user.statuses_count'] = tt_df['user.statuses_count'].convert_objects(convert_numeric=True)
total_tweets = tt_df.groupby(['user.screen_name'], as_index=False).max()



In [38]:

    
#restrict august dataframe to users in user_tc dataframe
aug = df_aug[(df_aug['user.screen_name'].isin(user_tc['user.screen_name']))].reset_index()

#http://stackoverflow.com/questions/12096252/use-a-list-of-values-to-select-rows-from-a-pandas-dataframe



In [61]:

    
#get reply count for each user
for i in range(0, len(user_tc)):
    count = aug[aug['user.screen_name'] == user_tc.loc[i, 'user.screen_name']]['in_reply_to_screen_name'].count()
    user_tc.loc[i, 'total_replies'] = count



In [64]:

    
#get retweet count for each user
for i in range(0, len(user_tc)):
    count = aug[aug['user.screen_name'] == user_tc.loc[i, 'user.screen_name']]['retweeted_status.user.id'].count()
    user_tc.loc[i, 'total_retweets'] = count



In [69]:

    
#calculate %retweets and %replies for the tweets with the #F/ferguson hashtag
user_tc['pct_replies'] = user_tc['total_replies'] / user_tc['count']
user_tc['pct_retweets'] = user_tc['total_retweets'] / user_tc['count']



In [70]:

    
user_tc.head()









    Out[70]:






  
    
      
      user.screen_name
      count
      perc_p
      total_replies
      total_retweets
      pct_replies
      pct_retweets
    
  
  
    
      0
            ItriSamele
       11
       0.727273
       0
        0
       0.000000
       0.000000
    
    
      1
            gohogsgirl
       28
       0.857143
       1
        8
       0.035714
       0.285714
    
    
      2
               averrer
       17
       0.823529
       0
        0
       0.000000
       0.000000
    
    
      3
           17147578976
       15
       1.000000
       0
       14
       0.000000
       0.933333
    
    
      4
       fucking_ninoX_X
       12
       0.833333
       0
        7
       0.000000
       0.583333



In [72]:

    
#merge all feature dataframes together
features = user_tc.merge(friends, on = 'user.screen_name', how = 'inner').merge(followers, on='user.screen_name', how='inner').merge(total_tweets, on='user.screen_name', how='inner')

#code help from http://stackoverflow.com/questions/23668427/pandas-joining-multiple-dataframes-on-columns



In [73]:

    
features.head()









    Out[73]:






  
    
      
      user.screen_name
      count
      perc_p
      total_replies
      total_retweets
      pct_replies
      pct_retweets
      user.friends_count
      user.followers_count
      user.statuses_count
    
  
  
    
      0
            gohogsgirl
       28
       0.857143
       1
        8
       0.035714
       0.285714
       1072.777778
       724.666667
       16987.777778
    
    
      1
           17147578976
       15
       1.000000
       0
       14
       0.000000
       0.933333
        155.785714
        15.000000
         144.928571
    
    
      2
       fucking_ninoX_X
       12
       0.833333
       0
        7
       0.000000
       0.583333
        311.000000
       384.000000
        9957.000000
    
    
      3
              76stephc
       17
       0.882353
       0
        5
       0.000000
       0.294118
        237.800000
        40.200000
        1148.400000
    
    
      4
           erinisinire
       14
       0.857143
       0
        8
       0.000000
       0.571429
        723.700000
       312.500000
        5412.800000

Determine if users remained engaged

We define a user as having remained engaged if they tweets 10 or more times with the hashtag #F/ferguson during Nov. 25 - Dec. 1.



In [74]:

    
#merge the august features dataframe with the november dataframe
aug_nov = pd.merge(features, nov_df, on='user.screen_name', how='left')



In [109]:

    
#determine if user remained engaged (1 = yes, 0 = no)
for i in range(0, len(aug_nov)):
    if aug_nov.loc[i, 'count_y'] >= 10:
        aug_nov.loc[i, 'rem_eng'] = 1
    else:
        aug_nov.loc[i, 'rem_eng'] = 0



In [110]:

    
aug_nov.head()









    Out[110]:






  
    
      
      user.screen_name
      count_x
      perc_p
      total_replies
      total_retweets
      pct_replies
      pct_retweets
      user.friends_count
      user.followers_count
      user.statuses_count
      count_y
      rem_eng
    
  
  
    
      0
            gohogsgirl
       28
       0.857143
       1
        8
       0.035714
       0.285714
       1072.777778
       724.666667
       16987.777778
        2
       0
    
    
      1
           17147578976
       15
       1.000000
       0
       14
       0.000000
       0.933333
        155.785714
        15.000000
         144.928571
      NaN
       0
    
    
      2
       fucking_ninoX_X
       12
       0.833333
       0
        7
       0.000000
       0.583333
        311.000000
       384.000000
        9957.000000
        9
       0
    
    
      3
              76stephc
       17
       0.882353
       0
        5
       0.000000
       0.294118
        237.800000
        40.200000
        1148.400000
      NaN
       0
    
    
      4
           erinisinire
       14
       0.857143
       0
        8
       0.000000
       0.571429
        723.700000
       312.500000
        5412.800000
      NaN
       0

Calculate % remaining engaged in this sample.



In [111]:

    
rem_eng = len(aug_nov[aug_nov['rem_eng'] == 1])
not_rem_eng = len(aug_nov[aug_nov['rem_eng'] == 0])
print rem_eng*1.0 / (rem_eng + not_rem_eng)









    



0.31844316674

Prepare X and Y data sets

Prepare the X and Y data sets for the random classifierl



In [112]:

    
#convert rem_eng to numpy array called Y
Y = np.array(aug_nov.rem_eng)



In [113]:

    
#drop user.screen_name
features_d = features.drop('user.screen_name', axis = 1)
features_d = features_d.fillna(0)
features_d.head()









    Out[113]:






  
    
      
      count
      perc_p
      total_replies
      total_retweets
      pct_replies
      pct_retweets
      user.friends_count
      user.followers_count
      user.statuses_count
    
  
  
    
      0
       28
       0.857143
       1
        8
       0.035714
       0.285714
       1072.777778
       724.666667
       16987.777778
    
    
      1
       15
       1.000000
       0
       14
       0.000000
       0.933333
        155.785714
        15.000000
         144.928571
    
    
      2
       12
       0.833333
       0
        7
       0.000000
       0.583333
        311.000000
       384.000000
        9957.000000
    
    
      3
       17
       0.882353
       0
        5
       0.000000
       0.294118
        237.800000
        40.200000
        1148.400000
    
    
      4
       14
       0.857143
       0
        8
       0.000000
       0.571429
        723.700000
       312.500000
        5412.800000



In [114]:

    
#convert features to matrix
X = features_d.as_matrix()

Features key:

count: number of tweets the user made between Aug. 10 and Aug. 17 with the hashtag #F/ferguson
perc_p: the percent of the user's tweets that were classified by the text classifier as 'pro-protester'
total_replies: number of tweets the user made between Aug. 10 and Aug. 17 with the hashtag #F/ferguson that were replies
total_retweets: number of tweets the user made between Aug. 10 and Aug. 17 with the hashtag #F/ferguson that were retweets
pct_reples: percent of tweets in count that were replies
pct_retweets: percent of tweets in count that were retweets
user.friends_count: average number of user's friends betwen Aug. 10 and Aug. 17
user.followers_count:average number of user's followers between Aug. 10 and Aug. 17
user.statuses_count: total number of user's lifetime tweets as of their last tweet in the Aug. data set

Train Random Forest Classifier and Generate Score Plots

A Random Forest Classifier will be trained, and score plots will be generated to help choose the best number of trees.



In [117]:

    
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import cross_val_score

#calculate cross val scores for each random forest
scores = []
for i in range(1, 31):
    rfc = RandomForestClassifier(n_estimators=i)
    score = cross_val_score(rfc, X, Y, cv=10)
    scores.append(score)

#calculate mean score for each random forest
score_means = np.mean(scores, axis=1)
trees = np.arange(30)+1

plt.figure(figsize=(15,8))
plt.scatter(trees, score_means, c='k', zorder=2)
sns.boxplot(scores)
plt.xlabel("Number of Trees")
plt.ylabel("Cross Validation Score")
plt.title("Cross Validation Score vs. Number of Trees")
plt.show()

16 trees looks good. Now we'll try with F1 scores.



In [91]:

    
#calcuate F1 scores for each random forest
scores_f1 = []
for i in range(1, 26):
    rfc = RandomForestClassifier(n_estimators=i)
    score = cross_val_score(rfc, X, Y, cv=10, scoring='f1')
    scores_f1.append(score)

#caluclate mean F1 score for each random forest
score_means_f1 = np.mean(scores_f1, axis=1)
trees = np.arange(25)+1

plt.figure(figsize=(15,8))
plt.scatter(trees, score_means_f1, c='k',zorder=2)
sns.boxplot(scores_f1)
plt.xlabel("Number of Trees")
plt.ylabel("F1 Score")
plt.title("Cross Validation Score using F1 Scoring vs. Number of Trees")
plt.show()

16 trees still looks good, but we know this data set is unbalanced, so we'll try with a custom cut-off.



In [92]:

    
def cutoff_predict(clf, X, cutoff):
    #generate prediction probabilities
    prob = clf.predict_proba(X)
    
    #convert probabilites to predictions
    clf_p = np.empty(len(prob))
    for i in range(len(prob)):
        if prob[i][1] > cutoff:
            clf_p[i] = 1
        else:
            clf_p[i] = 0
    return clf_p

#code from Homework #5



In [93]:

    
def custom_f1(cutoff):
    def f1_cutoff(clf, X, y):
        ypred = cutoff_predict(clf, X, cutoff)
        return sklearn.metrics.f1_score(y, ypred)
        
    return f1_cutoff

#code from Homework #5

#set range of cutoffs
cutoff_range = np.arange(0.1, 0.9, 0.1)

#set up Random Forest Classifier
rfc_c = RandomForestClassifier(n_estimators=15)

#Calculate custom F1 scores for each random forest
scores_cc = []
for i in cutoff_range:
    score_cc = cross_val_score(rfc_c, X, Y, cv=10, scoring = custom_f1(i))
    scores_cc.append(score_cc)
    
plt.figure(figsize=(15,8))
sns.boxplot(scores_cc, names=cutoff_range)
plt.xlabel("Cutoff Value")
plt.ylabel("F1 Score (using custom cutoff value)")
plt.title("Cross Validation Score with F1 Scoring vs. Cutoff Value")
plt.show()



In [100]:

    
#calcuate F1 scores for each random forest
scores_f1 = []
for i in range(1, 26):
    rfc = RandomForestClassifier(n_estimators=i)
    score = cross_val_score(rfc, X, Y, cv=10, scoring=custom_f1(0.2))
    scores_f1.append(score)

#caluclate mean F1 score for each random forest
score_means_f1 = np.mean(scores_f1, axis=1)
trees = np.arange(25)+1

plt.figure(figsize=(15,8))
plt.scatter(trees, score_means_f1, c='k',zorder=2)
sns.boxplot(scores_f1)
plt.xlabel("Number of Trees")
plt.ylabel("F1 Score")
plt.title("Cross Validation Score using Custom F1 Scoring (Cutoff = 0.2) vs. Number of Trees")
plt.show()

With the custom cutoff F1 score, 6 trees seems optimal. The most feature importances will be generated using 6 trees.

Determine importance of features



In [127]:

    
#set up and fit a random forest classifier
rfc_id = RandomForestClassifier(n_estimators=6)
rfc_id.fit(X, Y)
#calculate feature importances
importances = rfc_id.feature_importances_

index = np.arange(len(importances))
plt.bar(index, importances)
plt.xticks(index+0.5,features_d.columns.values, rotation=75)
plt.show()

The most important features are:

the tweet count, which is the number of tweets the user made between Aug. 10 and Aug. 17 with the hashtag #F/ferguson
the total number of tweets in the user's lifetime, as of the last tweet in our August dataset
the user's followers count

Show Pair Plots

The pair plots for the features were generated, using the rem_eng (remained engaged) value as the hue. These plots require further analysis to identify interesting trends.



In [115]:

    
#prep dataframe for pair plots
all_aug_nov = aug_nov.drop('user.screen_name', axis = 1)
all_aug_nov = all_aug_nov.drop('count_y', axis = 1)



In [87]:

    
all_aug_nov.head()









    Out[87]:






  
    
      
      count_x
      perc_p
      total_replies
      total_retweets
      pct_replies
      pct_retweets
      user.friends_count
      user.followers_count
      user.statuses_count
      rem_eng
    
  
  
    
      0
       28
       0.857143
       1
        8
       0.035714
       0.285714
       1072.777778
       724.666667
       16987.777778
       1
    
    
      1
       15
       1.000000
       0
       14
       0.000000
       0.933333
        155.785714
        15.000000
         144.928571
       0
    
    
      2
       12
       0.833333
       0
        7
       0.000000
       0.583333
        311.000000
       384.000000
        9957.000000
       1
    
    
      3
       17
       0.882353
       0
        5
       0.000000
       0.294118
        237.800000
        40.200000
        1148.400000
       0
    
    
      4
       14
       0.857143
       0
        8
       0.000000
       0.571429
        723.700000
       312.500000
        5412.800000
       0



In [88]:

    
sns.pairplot(all_aug_nov, hue='rem_eng')









    Out[88]:





<seaborn.axisgrid.PairGrid at 0x7fce93980410>

	user.screen_name	count	perc_p
0	ItriSamele	11	0.727273
1	gohogsgirl	28	0.857143
2	averrer	17	0.823529
3	17147578976	15	1.000000
4	fucking_ninoX_X	12	0.833333

	id	lang	_iso_created_at	text	user.id	user.screen_name	user.geo_enabled	user.statuses_count	user.friends_count	user.lang	...	entities.user_mentions	retweeted_status.user.id	retweeted_status.favorite_count	retweeted_status.favourities_count	retweeted_status.user.id.1	retweeted_status.user.followers_count	retweeted_status.user.friends_count	in_reply_to_status_id	retweeted_status.in_reply_to_status_id	retweeted_status.in_reply_to_user_id
0	538844861919928320	en	2014-11-30T00:00:00.000Z	@ainsleyearhardt #ferguson doesn't deserve #Da...	24520375	PathFlounder	False	3617	160	en	...	[ { "id" : 186646579, "indices" : [ 0, 16 ], "...	NaN	NaN	NaN	NaN	NaN	NaN	5.388415e+17	NaN	NaN
1	538844861181722624	en	2014-11-30T00:00:00.000Z	Darren Wilson has resigned from the Ferguson p...	18956073	dcexaminer	True	84607	12042	en	...	[]	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	538844860866760705	en	2014-11-30T00:00:00.000Z	RT @RT_com: BREAKING: Darren Wilson resigns in...	22842540	mickey228	False	69514	424	ja	...	[ { "id" : 64643056, "indices" : [ 3, 10 ], "i...	64643056	120	NaN	64643056	814889	451	NaN	NaN	NaN
3	538844858832912384	en	2014-11-30T00:00:00.000Z	.@BMUatYale ..."Which of #MyBrothersKeeper gon...	79838661	ugotGod	False	96307	71	en	...	[ { "id" : 1716317970, "indices" : [ 1, 11 ], ...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	538844856278196224	en	2014-11-29T23:59:59.000Z	RT @TheAnonMessage: BREAKING: Darren Wilson re...	511004909	Manny_Fresh21	True	27037	606	en	...	[ { "id" : 423662810, "indices" : [ 3, 18 ], "...	423662810	149	NaN	423662810	98591	298	NaN	NaN	NaN

	user.screen_name	count
0	000120o	43
1	000Dillon000	15
2	000RowanPark000	25
3	000kek	6
4	007JamesBong_	11

	user.screen_name	count	perc_p	total_replies	total_retweets	pct_replies	pct_retweets
0	ItriSamele	11	0.727273	0	0	0.000000	0.000000
1	gohogsgirl	28	0.857143	1	8	0.035714	0.285714
2	averrer	17	0.823529	0	0	0.000000	0.000000
3	17147578976	15	1.000000	0	14	0.000000	0.933333
4	fucking_ninoX_X	12	0.833333	0	7	0.000000	0.583333